Saturday, June 28, 2025
News PouroverAI
Visit PourOver.AI
No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
News PouroverAI
No Result
View All Result

Large Language Models, GPT-1 — Generative Pre-Trained Transformer | by Vyacheslav Efimov | Jan, 2024

January 27, 2024
in AI Technology
Reading Time: 5 mins read
0 0
A A
0
Share on FacebookShare on Twitter



Diving deeply into the working structure of the first version of gigantic GPT-models10 min read·10 hours ago2017 was a historical year in machine learning. Researchers from the Google Brain team introduced Transformer which rapidly outperformed most of the existing approaches in deep learning. The famous attention mechanism became the key component in the future models derived from Transformer. The amazing fact about Transformer’s architecture is its vaste flexibility: it can be efficiently used for a variety of machine learning task types including NLP, image and video processing problems.The original Transformer can be decomposed into two parts which are called encoder and decoder. As the name suggests, the goal of the encoder is to encode an input sequence in the form of a vector of numbers — a low-level format that is understood by machines. On the other hand, the decoder takes the encoded sequence and by applying a language modeling task, it generates a new sequence.Encoders and decoders can be used individually for specific tasks. The two most famous models deriving their parts from the original Transformer are called BERT (Bidirectional Encoder Representations from Transformer) consisting of encoder blocks and GPT (Generative Pre-Trained Transformer) composed of decoder blocks.Transformer architectureIn this article, we will talk about GPT and understand how it works. From the high-level perspective, it is necessary to understand that GPT architecture consists of a set of Transformer blocks as illustrated in the diagram above except for the fact that it does not have any input encoders.As for most LLMs, GPT’s framework consists of two stages: pre-training and fine-tuning. Let us study how they are organised.1. Pre-trainingLoss functionAs the paper states, “We use a standard language modeling objective to maximize the following likelihood”:Pre-training loss function.In this formula, at each step, the model outputs the probability distribution of all possible tokens being the next token i for the sequence consisting of the last k context tokens. Then, the logarithm of the probability for the real token is calculated and used as one of several values in the sum above for the loss function.The parameter k is called the context window size.The mentioned loss function is also known as log-likelihood.Encoder models (e.g. BERT) predict tokens based on the context from both sides while decoder models (e.g. GPT) only use the previous context, otherwise they would not be able to learn to generate text.GPT diagram during pre-trainingThe intuition behind the loss functionSince the expression for the log-likelihood might not be easy to comprehend, this section will explain in detail how it works.As the name suggests, GPT is a generative model indicating that its ultimate goal is to generate a new sequence during inference. To achieve it, during training an input sequence is embedded and split by several substrings of equal size k. After that, for each substring, the model is asked to predict the next token by generating the output probability distribution (by using the final softmax layer) built for all vocabulary tokens. Each token in this distribution is mapped to the probability that exactly this token is the true next token in the subsequence.To make the things more clear, let us look at the example below in which we are given the following string:We split this string into substrings of length k = 3. For each of these substrings, the model outputs a probability distribution for the language modeling task. The predicted distrubitons are shown in the table below:In each distribution, the probability corresponding to the true token in the sequence is taken (highlighted in yellow) and used for loss calculation. The final loss equals the sum of logarithms of true token probabilities.GPT tries to maximise its loss, thus higher loss values correspond to better algorithm performance.From the example distributions above, it is clear that high predicted probabilities corresponding to true tokens add up larger values to the loss function demonstrating better performance of the algorithm.Subtlety behind the loss functionWe have understood the intuition behind the GPT’s pre-training loss function. Nevertheless, the expression for the log-likelihood was originally derived from another formula and could be much easier to interpret!Let us assume that the model performs the same language modeling task. However, this time, the loss function will maximize the product of all predicted probabilities. It is a reasonable choice as all of the output predicted probabilities for different subsequences are independent.Multiplication of probabilities as the loss value for the previous exampleComputed loss valueSince probability is defined in the range [0, 1], this loss function will also take values in that range. The highest value of 1 indicates that the model with 100% confidence predicted all the corrected tokens, thus it can fully restore the whole sequence. Therefore,Product of probabilities as the loss function for a language modeling task, maximizes the probability of correctly restoring the whole sequence(-s).General formula for product probability in language modelingIf this loss function is so simple and seems to have such a nice interpretation, why it is not used in GPT and other LLMs? The problem comes up with computation limits:In the formula, a set of probabilities is multiplied. The values they represent are usually very low and close to 0, especially when during the beginning of the pre-training step when the algoroithm has not learned anything yet, thus assigning random probabilities to its tokens.In real life, models are trained in batches and not on single examples. This means that the total number of probabilities in the loss expression can be very high.As a consequence, a lot of tiny values are multiplied. Unfortunately, computer machines with their floating-point arithmetics are not good enough to precisely compute such expressions. That is why the loss function is slightly transformed by inserting a logarithm behind the whole product. The reasoning behind doing it is two useful logarithm properties:Logarithm is monotonic. This means that higher loss will still correspond to better performance and lower loss will correspond to worse performance. Therefore, maximizing L or log(L) does not require modifications in the algorithm.Natural logarithm plotThe logarithm of a product is equal to the sum of the logarithms of its factors, i.e. log(ab) = log(a) + log(b). This rule can be used to decompose the product of probabilities into the sum of logarithms:We can notice that just by introducing the logarithmic transformation we have obtained the same formula used for the original loss function in GPT! Given that and the above observations, we can conclude an important fact:The log-likelihood loss function in GPT maximizes the logarithm of the probability of correctly predicting all the tokens in the input sequence.Text generationOnce GPT is pre-trained, it can already be used for text generation. GPT is an autoregressive model meaning that it uses previously predicted tokens as input for prediction of next tokens.On each iteration, GPT takes an initial sequence and predicts the next most probable token for it. After that, the sequence and the predicted token are concatenated and passed as input to again predict the next token, etc. The process lasts until the [end] token is predicted or the maximum input size is reached.Autoregressive completion of a sentence with GPT2. Fine-tuningAfter pre-training, GPT can capture linguistic knowledge of input sequences. However, to make it better perform on downstream tasks, it needs to be fine-tuned on a supervised problem.For fine-tuning, GPT accepts a labelled dataset where each example contains an input sequence x with a corresponding label y which needs to be predicted. Every example is passed through the model which outputs their hidden representations h on the last layer. The resulting vectors are then passed to an added linear layer with learnable parameters W and then through the softmax layer.The loss function used for fine-tuning is very similar to the one mentioned in the pre-training phase but this time, it evaluates the probability of observing the target value y instead of predicting the next token. Ultimately, the evaluation is done for several examples in the batch for which the log-likelihood is then calculated.Loss function for downstream taskAdditionally, the authors of the paper found it useful to include an auxiliary objective used for pre-training in the fine-tuning loss function as well. According to them, it:improves the model’s generalization;accelerates convergence.GPT diagram during fine-tuning. Image adopted by the author.Finally, the fine-tuning loss function takes the following form (α is a weight):Fine-tuning loss functionThere exist a lot of approaches in NLP for fine-tuning a model. Some of them require changes in…



Source link

Tags: EfimovgenerativeGPT1JanlanguageLargemodelsPretrainedTransformerVyacheslav
Previous Post

Portescap adds encoder, motor for robotics

Next Post

Lyrical Asset Management 2023 U.S. Value Review Letter

Related Posts

How insurance companies can use synthetic data to fight bias
AI Technology

How insurance companies can use synthetic data to fight bias

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset
AI Technology

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
How Game Theory Can Make AI More Reliable
AI Technology

How Game Theory Can Make AI More Reliable

June 9, 2024
Decoding Decoder-Only Transformers: Insights from Google DeepMind’s Paper
AI Technology

Decoding Decoder-Only Transformers: Insights from Google DeepMind’s Paper

June 9, 2024
Buffer of Thoughts (BoT): A Novel Thought-Augmented Reasoning AI Approach for Enhancing Accuracy, Efficiency, and Robustness of LLMs
AI Technology

Buffer of Thoughts (BoT): A Novel Thought-Augmented Reasoning AI Approach for Enhancing Accuracy, Efficiency, and Robustness of LLMs

June 9, 2024
Deciphering Doubt: Navigating Uncertainty in LLM Responses
AI Technology

Deciphering Doubt: Navigating Uncertainty in LLM Responses

June 9, 2024
Next Post
Lyrical Asset Management 2023 U.S. Value Review Letter

Lyrical Asset Management 2023 U.S. Value Review Letter

Beijing intensifies military pressure on Taiwan as U.S.-China talks resume

Beijing intensifies military pressure on Taiwan as U.S.-China talks resume

13 Best PHP Projects With Source Code [2024]

13 Best PHP Projects With Source Code [2024]

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Trending
  • Comments
  • Latest
23 Plagiarism Facts and Statistics to Analyze Latest Trends

23 Plagiarism Facts and Statistics to Analyze Latest Trends

June 4, 2024
How ‘Chain of Thought’ Makes Transformers Smarter

How ‘Chain of Thought’ Makes Transformers Smarter

May 13, 2024
Amazon’s Bedrock and Titan Generative AI Services Enter General Availability

Amazon’s Bedrock and Titan Generative AI Services Enter General Availability

October 2, 2023
Is C.AI Down? Here Is What To Do Now

Is C.AI Down? Here Is What To Do Now

January 10, 2024
The Importance of Choosing a Reliable Affiliate Network and Why Olavivo is Your Ideal Partner

The Importance of Choosing a Reliable Affiliate Network and Why Olavivo is Your Ideal Partner

October 30, 2023
How To Build A Quiz App With JavaScript for Beginners

How To Build A Quiz App With JavaScript for Beginners

February 22, 2024
Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

June 10, 2024
AI Compared: Which Assistant Is the Best?

AI Compared: Which Assistant Is the Best?

June 10, 2024
How insurance companies can use synthetic data to fight bias

How insurance companies can use synthetic data to fight bias

June 10, 2024
5 SLA metrics you should be monitoring

5 SLA metrics you should be monitoring

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

June 10, 2024
Facebook Twitter LinkedIn Pinterest RSS
News PouroverAI

The latest news and updates about the AI Technology and Latest Tech Updates around the world... PouroverAI keeps you in the loop.

CATEGORIES

  • AI Technology
  • Automation
  • Blockchain
  • Business
  • Cloud & Programming
  • Data Science & ML
  • Digital Marketing
  • Front-Tech
  • Uncategorized

SITEMAP

  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact us

Copyright © 2023 PouroverAI News.
PouroverAI News

No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing

Copyright © 2023 PouroverAI News.
PouroverAI News

Welcome Back!

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Fill the forms bellow to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In